======================================================== #Intro This project is an analysis of a red wine’s sample which includes 1599 kinds of red wine with 11 variables about the chemical ingredients of wine. After statistics and investigation, almost 3 experts of red wine grade each kind of wine’s quality and then provide a fraction between 0(worst) and 10(perfect). The leading question is which chemical ingredients will affect the quality of red wine.

Based on the data frame, I will process a overview of all variables at first, and then I will explore the relationships or correlation between variables. Through this exploration, I will think about and deal with some problems based on the discovery.

Load the Data

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Analysis of columns’ names

It looks like the first column’s name is ‘X’, but I think it is better to drop this variable which has no impact in this analysis.

# Remove the 'X' column using logical statement
myvar <- names(redwine) %in% c('X')
redwine <- redwine[!myvar]
colnames(redwine)
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

Univariate Plots Section

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Univariate Plots

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

From the graph shows, it is clear that the range of quality is from 3 to 8, and 5 is the most score in whole. Furthermore, 5-6 covers the vast majority of whole observation, which is over 1200 in statistic; the number of high quality from 7 to 8 just over 200 a little; low quality which are 3 or 4 in score even less than 100 in total observation.

From the description of attributes, I noticed that all features can be divided into 4 groups, which are acid group, substance group, chemical group and measure group. And then, I will create related plots to display the charactors of each group.

1.Acid Group

# Table the variable to see the number of observation whose value equals to 0 in
# fixed.acidity, volatile.acidity, and citric.acid
table(redwine$fixed.acidity == 0)
## 
## FALSE 
##  1599
table(redwine$volatile.acidity == 0)
## 
## FALSE 
##  1599
table(redwine$citric.acid == 0)
## 
## FALSE  TRUE 
##  1467   132

As the figure exhibits, the distributions of both fixed acidity and volatile acidity are right skewed, and both of them tend to normal distribution. And it is clear that 132 samples do not include values of citric acid.

2. Substance Group

## Warning: Removed 91 rows containing non-finite values (stat_bin).
## Warning: Removed 1 rows containing missing values (geom_bar).
## Warning: Removed 48 rows containing non-finite values (stat_bin).

It is obvious that the distributions of residual sugar and chlorides are right skewed, so I changed the limitation of x axis depending on the mean value of summary in the last section, which can deeply observe the trend of these two variables histogram.

After doing modification about the x axis, three graph tells me that changed figures of residual sugar and chlorides both show a normal distribution, and the observation of alcohol displays an analogous normal distribution which are right skewed.

3.Chemical Group

These three plots all shows that they involve some outliers, and after avioding there outliers, the third figure looks like a normal distribution in particular; however, the original observation of three plots are all skewedto right side.

4. Measure Group

## $x
## [1] "Density"
## 
## attr(,"class")
## [1] "labels"
## $x
## [1] "pH"
## 
## attr(,"class")
## [1] "labels"

From these two figures, it is obvious that the histograms of variable density and variable pH are normal distribution. And it looks like there is no apparant outlier.

What is the structure of your dataset?

From the summary of the datafram ‘redwine’, it shows that the dataframe contains 1599 observations, and each observation has 13 unique attributions which are variables of redwine. They are in the following:

  • X
  • fixed.acidity
  • volatile.acidity
  • citric.acid
  • residual.sugar
  • chlorides
  • free.sulfur.dioxide
  • total.sulfur.dioxide
  • density
  • pH
  • sulphates
  • alcohol
  • quality

What is/are the main feature(s) of interest in your dataset?

The main feature of this dataset is quality, and I am interested with which variables may impact the quality, and how it affects. According to the classification, I have categorized all variables into 3 groups, and I guess there maight be some correlationship between each group, so I will explore the relationship of two variables at first. After that, I will process further investigation with multiple variables.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

In my opinion, I think density and pH might be the impacted factors of quality, and the variables in acid group will affect the value of pH which is similar as the relationship between density and substance group. So, variables in these two groups will assist my analysis of the interested feature.

Did you create any new variables from existing variables in the dataset?

Well, I think it is better to change the variable ‘X’ into ‘id’ so that it looks better. And then I want to create two variables according to the description of each variable, and they are: 1. total.acidity, which is the total quantity of fixed acidity and volatile acidity; 2. bound.sulfur.dioxide, which is the difference between total sulfur dioxide and free sulfur dioxide.

# Create two columns with new variables which are total.acidity and 
# bound.sulfur.dioxide
redwine$total.acidity <- redwine$fixed.acidity + redwine$volatile.acidity
redwine$bound.sulfur.dioxide <- 
  redwine$total.sulfur.dioxide - redwine$free.sulfur.dioxide
# Create a new dataset with standby application
redwine2 <- redwine

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

After creating each variable’s histogram, I notice that the distribution of citric acid is abnormal comparing others, which has a lot of 0 values and that means many red wine samples don’t contain this element. According to other variables’ plots, most of them are right skewed distribution, and some are normal distribution. I just changed the title of first column because this dataframe is tidy and clean.

Bivariate Plots Section

First of all, I must create the visualization of all variablesc(except id) with quality to search for the strongest correlation both in positive and negative.

Correlation Between Quality and Acid Group

From these three plots, we can see the fixed.acidity and total.acidity have no obvious changing trend with the quality’s variation, but the volatile.acidity decreases when the level of quality raises, and the citric.acid has a positive correlation with the quality.

Correlation Between Quality and Substance Group

As plots show, residual sugar and chlorides always remains in a low level of quantity, and both of them don’t shows any apperant correlationship with quality, but the alcohol seems to keep increasing from low level of quality to high level. By the way, it is likely that the chlorides have a hazy negative correlationship with quality.

Correlation Between Quality and Chemical Group

It is clear that sulphates has a legibel positive correlation with the quality, and in a specific range, both bound sulfur dioxide and total sulfur dioxide decrease when the quality increase, which means they have negative correlationship.

Correlation Between Quality and Chemical Group

The first plot shows the negative correlationship between density and quality of red wine, and the second one display a decreasing trend with the rise of quality’s level.

These plots show that the variables which have a great positive correlation with the quality are: - citric.acid
- sulphates
- alcohol

The powerful negative correlation variables with the quality: - volatile.acidity - chlorides - bound.sulfur.dioxide
- total.sulfur.dioxide
- density

Then, I need to gather the specific correlation coefficient of the variable listed above:

# First of all, apply the standby application because the quality variable has 
# bee changed
redwine <- redwine2
# Create correlation coefficient for the variables which have a great positive 
# correlation with the quality
cor.test(redwine$citric.acid, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$citric.acid and redwine$quality
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725
cor.test(redwine$sulphates, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$sulphates and redwine$quality
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971
cor.test(redwine$alcohol, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$alcohol and redwine$quality
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
# Create correlation coefficient for the variables which have a great negative 
# correlation with the quality
cor.test(redwine$volatile.acidity, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$volatile.acidity and redwine$quality
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578
cor.test(redwine$chlorides, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$chlorides and redwine$quality
## t = -5.1948, df = 1597, p-value = 2.313e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.17681041 -0.08039344
## sample estimates:
##        cor 
## -0.1289066
cor.test(redwine$bound.sulfur.dioxide, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$bound.sulfur.dioxide and redwine$quality
## t = -8.3898, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2519465 -0.1580336
## sample estimates:
##       cor 
## -0.205463
cor.test(redwine$total.sulfur.dioxide, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$total.sulfur.dioxide and redwine$quality
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003
cor.test(redwine$density, redwine$quality, method = 'pearson')
## 
##  Pearson's product-moment correlation
## 
## data:  redwine$density and redwine$quality
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

From the information provided above, it is better to extract the most positive and negative correlation. The largest correlation coefficient is 0.4761663 which is from alcohol, and the smallest correlation coefficient is is -0.3905578 which is from volatile.acidity.

By the way, sometimes there might be interesting relationship between two variables with no logic in mind, but it is a good chance to explore more about these variables. So, I tend to use ‘ggpairs’ to look for surprised correlationship.

ggpairs(redwine)

From the plot shows, most of conditions with a big correlation coefficient are between two relative variables, like citric.acid and total.acid; however, I still find some strange relationships, which are in the following: fixed.acidity - density alcohol - density

Correlationship Between Two Variables Which I Am Interested

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  1. The correlation coefficient between fixed acidity and density is 0.67 which is larger than 0.5 and is poistive, and this means that when the quantity of fixed acidity increase in the red wine, the density of wine will get a big rise;
  2. Alcohol and density have a negative correlation which makes sense that alcohol’s density is less than water’s, so when alcohol contains more, the whole density of red wine will drop. By the way , the coefficient is -0.5;
  3. I also notice a unnormal phenomenon, just like the pH will raise when volatile acidity increase, which makes me confused that acid causes the pH goes up rather than drops.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Compared to the relationship between alcohol and density, the correlation coefficient of residual sugar and density is less than 0.5, but it should be larger than what it is. Also, the effect of the element related to sulfur dioxide towards to pH is less than what I expect because I think this kind of things will decrease the pH of red wine.

What was the strongest relationship you found?

That is the relationship between fixed.acidity and total.acidity, whose correlation coeffecient is 0.995. And it means that the rate of the volatile acidity in the total acidity is quite small.

Multivariate Plots Section

As the four groups I have divided previously, some variables affect the quality of red wine concurrently in the same group, which are citric acid with volatile acidity, total.sulfur.dioxide with sulphates, and chlorides with alcohol.

Citric Acid With Volatile Acidity

This figure shows that more citric acid quantity and less valatile acidity quantity might match red wine with higher quality. Although the smooth lines give a clear trend of points distributed in different colors, it still a simulator with error.

Total Sulfur Dioxide With Sulphates

It is obvious that except little outliers, almost all high quality of red wine are in the condition with more sulphates value and less total sulfur dioxide in relatively, which means when sulfur dioxide is in a small quantity, the more sulphates the red wine has, the higher quality of this wine is in.

Chlorides With Alcohol

This plot is similar with the last plot, high alcohol and low chlorides are related to high quality red wines. It makes sense that none would like to taste salty red wine, and the rate of alcohol in red wine is larger in high quality than in low quality in relatively.

There are still some relationship between two variables which are not in a same group, like chlorides with sulphates, and fixed.acidity with residual.sugar.

Chlorides With Sulphates

The figure has a total diference than others, which is the condition with both high quantity of two variables. And it shows that when chlorides and sulphates are both high in value, the quality of red wine would be quite low. By the way, the high quality wine only exsit on high sulphates and low chlorides.

Fixed Acidity With Residual Sugar

From the plot, it is hard to say the pattern of these two variables with quality, but it looks like if residual sugar is constant, the quality of red wine will increase with the rise of fixed acidity quantity until it equals to about 9, and then it reverse to the lowest quality at once. After that, it will increase a little level of quality for a short while. All in all, residual sugar just maintains in a very low quantity in red wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  1. Citric acid has tight correlationship with volatile and fixed acidity seperately, but one is negative and the other one is positive; however, high quantity of citric improve the quality of red wine through adding taste.
  2. Although sulphates has a weak correlationship with total sulfur dioxide, relatively greater content of sulphates in red wine can result in higher quality red wine.
  3. The relationship between alcohol and quality is the most positive correlationship, and the coefficient is the largest one than others.
  4. It is obvious that high rate of sulphates and low rate of chlorides are reliable foundations to justify whether the red wine is high quality.

Were there any interesting or surprising interactions between features?

I am surpried that fixed.acidity has the most positive correlation with density rather than others, and the correlation coefficients between chlorides and other variables are averagely small, but chlorides has a relatively tight relationship with sulphates.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

# Build a linear model between alcohol and quality
m1 <- lm(I(alcohol) ~ I(quality), data = redwine)
# Add volatile.acidity to the model
m2 <- update(m1, ~.+volatile.acidity)
mtable(m1, m2)
## 
## Calls:
## m1: lm(formula = I(alcohol) ~ I(quality), data = redwine)
## m2: lm(formula = I(alcohol) ~ I(quality) + volatile.acidity, data = redwine)
## 
## ================================================
##                          m1            m2       
## ------------------------------------------------
##   (Intercept)           6.882***      6.998***  
##                        (0.165)       (0.220)    
##   I(quality)            0.628***      0.618***  
##                        (0.029)       (0.032)    
##   volatile.acidity                   -0.115     
##                                      (0.142)    
## ------------------------------------------------
##   R-squared             0.227         0.227     
##   adj. R-squared        0.226         0.226     
##   sigma                 0.937         0.937     
##   F                   468.267       234.406     
##   p                     0.000         0.000     
##   Log-likelihood    -2164.504     -2164.179     
##   Deviance           1403.295      1402.725     
##   AIC                4335.007      4336.358     
##   BIC                4351.139      4357.866     
##   N                  1599          1599         
## ================================================

This is the linear model, and it is clear that the R-squared is 0.227, which means the fitting degree is not good enough. So, this model is unuseful for my analysis, and it cannot show the correct correlationship between most correlated variables and quality.


Final Plots and Summary

Plot One

Description One

This is the modified version of the first plot in this EDA, and the reason I choose this one is that quality is the main feature I have to explore in this process, so I must realize all information of this feature, and after that I can enter the next step. From this figure, it is clear that the integral part is in the fraction 5 and 6, which means this dataset is valide enough to explore the relationship between quality and other variables because the distribution of quality is normal.

Plot Two

Description Two

This figure is the most successful one in all plots, and I change the x and y comparing the original one because I think after changing, it is more clear about the linear relationship between these two variables.It shows that each quality has a unique regression in negative correlation with different intercepts, and the quality in 8 has the largest numeric in intercept, which means high quality red wine has high value of citric acid and low rate of volatile acidity.

Plot Three

Description Three

Alcohol is the most positive correlated variable with the quality of red wine in this dataset, from this modified figure, it is clear that the red wine with highest quality is in the range from 11% to 13% alcohol, and it also shows this highest quality red wine just involves a quite low level of chlorides. And with the increase of the amount of chlorides, the quality of red wine drops gradually.


Reflection

This is a very long and complicated project for myself because this is the first time I must provide my own idea for the direction of exploration. Although the dataset is not a huge one, which just includes 1599 observations and 13 variables, I still feel that when I create plot and build model, there are some outliers which will bother my analysis even if they are just a little.

After exploring in my own, there are some insight about the strongest correlationship between quality and other variables. The relatively positive correlation coefficient are 0.476 from alcohol and 0.251 from sulphates; the relatively negative correlation coefficient is -0.391 from volatile.acidity. Even through these variables have tight relationship with quality, many other variables still affect the level of quality in red wine which should be proved by more complex model building or analysis.

When I was exploring, I noticed that some variables belong to the same category, so I firstly divided them into 4 different groups. Then I found that analyzing two independed variables in the same group as first step, and secondly process in different groups made the whole process more reliable. All information for two variables’ exploration is specific and easy to understand, like the trend of each plot is obvious, correlation coefficient of each pair is more identified.

There are still some phenomenons that I didn’t expect before, like the relationship between volatile acidity and pH is negative correlation, the correlationship between residual sugar and density is quite weak, and so on.

In the future, I think it is better to add more variables like the ingredient of red wine, the environmental temperature when red wine is produced, and so on. Then, it is better to append some text description of some variables, like adding unit of variable, providing some tips about the effect of low or high quantity of the variable.

References

[1] https://ggplot2.tidyverse.org/reference/geom_smooth.html

[2] https://ggplot2.tidyverse.org/articles/ggplot2-specs.html

[3] https://stackoverflow.com/questions/27433798/how-to-change-y-axis-range-to-percent-from-number-in-barplot-with-r

[4] https://www.applysquare.com/topic-cn/SPt2Al6uO/